knitr::opts_chunk$set(
eval = F,
echo = T,
message = F,
warning = F
)In [1]:
The Roster Compile script processes records from PHL, Template Submitters, fuzzy matching, for review, keep na, and ad hoc files to create the sequencing roster. This script imports the files that are ready to be rostered, performs QA checks, and outputs the files that are ready to be added to WDRS.
Setup
Libraries
In [2]:
library(lubridate)
library(tidyverse)
library(readxl)
library(openxlsx)
library(fs)
library(here)
library(sendmailR)
library(DBI)
library(odbc)Read in RDS objects
All of these are used within the roster_filters function for quality checks.
The lab_variables.rds file is created in Roster_scripts/write_lab_variables.R. It includes valid sequence reasons and lab names and is used across multiple scripts. Please refer to the tables on the GitHub wiki for updates.
The lineages.csv file contains all lineages, both active and withdrawn.
The valid years vector starts with 2020 and continues to the present year.
In [3]:
# read in r_creds.RDS
r_creds <-readRDS(file.path(Sys.getenv("USERPROFILE"), "Projects/Sequencing/Data_Objects", "r_creds.RDS"))
# Bring in object with sequence reasons and sequence laboratories
lab_vars <- readRDS("Data_Objects/lab_variables.rds")
# Bring in file with all lineages (active or withdrawn)
lineages <- read_csv("Data_Objects/Lineages/Lineages.csv",
col_names = TRUE,
col_types = cols(.default = "c"),
na = c("", "NA", "N/A"))
#Bring in GISAID_
wa_gisaid <- read_rds("GISAID Data/wa_gisaid.rds")
# Read in quality_filters
source(file.path(Sys.getenv("USERPROFILE"), "Projects/Sequencing/Roster_scripts/quality_filters.R"))
# Append Unassigned lineage to lineages list
df <- data.frame (lineage_extracted = "Unassigned",
description = NA,
status = NA
)
lineages <- bind_rows(lineages,df)
# Create vector of years
years <- year(seq(ymd("2020-01-01"), today(), by = "years"))
valid_years <- paste(years, collapse = "|")Connect to WDRS
IMPORTANT the variables used to connect to WDRS are held within conn_list.RDS. All .RDS objects in this repository except for VOC.RDS are excluded from Git commits by declaring *.RDS in the .gitignore file because they are often used to hold our “secrets” such as credential and server connections. We do not include server connections in code uploaded to GitHub. WHY? We have been asked by HTS to ensure our use of GitHub does not raise any security red flags. This server is an internal server containing confidential/restricted PHI. We want to hide this server information to reduce our possible “attack surface”. This connection may seem benign but it tells someone information they can use to “hack” into WDRS. SQL Server Native Client is now deprecated software and version 11 was the last release. Unsupported software is at higher risk of having security breaches. Additionally, someone would know the server name. So: DO NOT alter the code used to open the connection to WDRS in any way that creates a security risk. Continue to treat this connection as a secret and store its variables in a .RDS object (or other external object that is excluded from Git commits) rather than calling them directly here.
In [4]:
# connect
connection <- DBI::dbConnect(odbc::odbc(),
Driver = r_creds$conn_list[1],
Server = r_creds$conn_list[2],
Database = r_creds$conn_list[3],
Trusted_connection = r_creds$conn_list[4],
ApplicationIntent = r_creds$conn_list[5])WDRS Query
This queries the WDRS sequencing table for all SEQUENCE_ACCESSION and SEQUENCE_CLINICAL_ACCESSION values.
In [5]:
wdrs_seq <- dbGetQuery(connection, "
SELECT DISTINCT SEQUENCE_ACCESSION_NUMBER,
SEQUENCE_CLINICAL_ACCESSION_NUMBER
FROM [WDRS].[dbo].[DD_GCD_COVID19_SEQUENCING]
WHERE CASE_STATUS != 6
")WDRS Sequence Accession values
The list of SA values is used in the roster_filters function to check if a record’s SA already exists in WDRS.
In [6]:
# Omit any NA's
wdrs_sa_seq_clean <- wdrs_seq[['SEQUENCE_ACCESSION_NUMBER']][!is.na(wdrs_seq[['SEQUENCE_ACCESSION_NUMBER']])] %>%
# For fields that have "hCoV-19/" appended to the beginning of the SEQUENCE_ACCESSION remove it by str_replace() with ""
str_replace("hCoV-19/", "") %>%
# Trim off the white space resulting from str_split, this also gets rid of " " values
str_trim("both")
# Remove any values that are ""
wdrs_sa_seq_values <- wdrs_sa_seq_clean[wdrs_sa_seq_clean != ""]WDRS Sequence Clinical Accession values
The SEQUENCE_CLINICAL_ACCESSION values are split at each “,” to create a single vector of values. The list of SCA values is used in the roster_filters function to check if a record’s SCA already exists in WDRS.
In [7]:
# Omit any NA's
wdrs_sca_seq_clean <- wdrs_seq[['SEQUENCE_CLINICAL_ACCESSION_NUMBER']][!is.na(wdrs_seq[['SEQUENCE_CLINICAL_ACCESSION_NUMBER']])] %>%
# For fields that have "hCoV-19/" appended to the beginning of the SEQUENCE_ACCESSION remove it by str_replace() with ""
str_replace("hCoV-19/", "") %>%
# Trim off the white space resulting from str_split, this also gets rid of " " values
str_trim("both")
# Remove any values that are ""
wdrs_sca_seq_values <- wdrs_sca_seq_clean[wdrs_sca_seq_clean != ""]Compile records
Read in files from write_roster_here
Files that are ready to be compiled for the roster are saved to the write_roster_here folder. A new file with today’s date is created in the Archive folder. All files are read in and moved to the Archive folder for future reference. The index column is added when files are read in. The column includes the name of the file and can be used in troubleshooting. This column should not be included in the final file.
In [8]:
# List all rosters that were outputted to the write_roster_here folder
compile_files <- dir_ls("write_roster_here",type = "file", pattern = "*.csv")
# Initialize list
Compile_Rosters <- list()
if(length(compile_files) > 0){
# Create folder in Archive
dir_create(paste0(file.path("write_roster_here/Archive", today())))
# Files saved in vector format
roster_files <- dput(as.vector(compile_files))
# Set names
names(roster_files) <- roster_files
# Read in files, all columns as character
Compile_Rosters <- roster_files %>%
map_df(~read_csv(., col_types = cols(
CASE_ID = col_character(),
SEQUENCE_SGTF = col_character(),
SEQUENCE_SPECIMEN = col_character(),
SEQUENCE_REASON = col_character(),
SEQUENCE_DATE = col_character(),
SEQUENCE_LAB = col_character(),
SEQUENCE_STATUS = col_character(),
SEQUENCE_REPOSITORY = col_character(),
SEQUENCE_ACCESSION = col_character(),
SEQUENCE_VARIANT_OPEN_TEXT = col_character(),
SEQUENCE_CLINICAL_ACCESSION = col_character(),
SEQUENCE_SPECIMEN_COLLECTION_DATE = col_character(),
SEQUENCE_NOTES = col_character(),
SEQUENCE_REVIEWED = col_character(),
Case.Note = col_character()
),
# Read as missing
na = c("", "NA", "N/A", "None", "NONE")
),
# Add index column to store file path
.id = "index")
# Move files to archive
file_move(roster_files, paste0("write_roster_here/Archive/", today()))
}
# There are two versions of the roster columns in various places, remove EPI_ISL and Roster Prepared date from any file in write_roster_here
Compile_Rosters <- Compile_Rosters %>% select(-matches("SEQUENCE_EPI_ISL|SEQUENCE_ROSTER_PREPARE_DATE"))WDRS column order
The key_cols vector is the correct order of column names for WDRS upload. DO NOT CHANGE THESE! WDRS uploads the file based on this order, not on the column name. If the order is changed, the file will be uploaded to WDRS as is, and the WDRS data won’t be in the correct columns. The key_cols_index vector includes the roster columns and the index column and is used to remove extra columns from the input files.
In [9]:
# Columns that are included in roster (this is the correct order for WDRS upload)
key_cols = c(
"CASE_ID",
"SEQUENCE_SGTF",
"SEQUENCE_SPECIMEN",
"SEQUENCE_DATE",
"SEQUENCE_REASON",
"SEQUENCE_LAB",
"SEQUENCE_STATUS",
"SEQUENCE_REPOSITORY",
"SEQUENCE_ACCESSION",
"SEQUENCE_EPI_ISL",
"SEQUENCE_VARIANT_OPEN_TEXT",
"SEQUENCE_CLINICAL_ACCESSION",
"SEQUENCE_SPECIMEN_COLLECTION_DATE",
"SEQUENCE_ROSTER_PREPARE_DATE",
"SEQUENCE_NOTES",
"SEQUENCE_REVIEWED",
"Case.Note")
# Columns that are included in roster, plus index
key_cols_index = c(
"CASE_ID",
"SEQUENCE_SGTF",
"SEQUENCE_SPECIMEN",
"SEQUENCE_DATE",
"SEQUENCE_REASON",
"SEQUENCE_LAB",
"SEQUENCE_STATUS",
"SEQUENCE_REPOSITORY",
"SEQUENCE_ACCESSION",
"SEQUENCE_EPI_ISL",
"SEQUENCE_VARIANT_OPEN_TEXT",
"SEQUENCE_CLINICAL_ACCESSION",
"SEQUENCE_SPECIMEN_COLLECTION_DATE",
"SEQUENCE_ROSTER_PREPARE_DATE",
"SEQUENCE_NOTES",
"SEQUENCE_REVIEWED",
"Case.Note",
"index")Compile rosters, addition of SEQUENCE_EPI_ISL and SEQUENCE_ROSTER_PREPARE_DATE columns, cleansing
In [10]:
# Bind rosters
Compiled_Roster <- bind_rows(Compile_Rosters)
wa_gisaid <- as_tibble(wa_gisaid)
# Join the "Accession ID" from wa_gisaid to populate the new vars: "SEQUENCE_EPI_ISL". Create the new var SEQUENCE_ROSTER_PREPARE_DATE with today's date in mm/dd/yyyy format. Trim SEQUENCE_SPECIMEN_COLLECTION_DATE. Remove duplicates and index columns
Compiled_Roster_Add_Cols <- Compiled_Roster %>%
# join "virus_name_clean", "Accession ID", "gisaid_epi_isl" columns from wa_gisaid
left_join(wa_gisaid[c("virus_name_clean", "Accession ID", "gisaid_epi_isl")], by = c("SEQUENCE_ACCESSION" = "virus_name_clean")) %>%
mutate("SEQUENCE_EPI_ISL" = case_when(
# 'Accession ID' is populated and 'gisaid_epi_isl' is missing
(!is.na(`Accession ID`) & is.na(gisaid_epi_isl)) ~ `Accession ID`,
# 'Accession ID' is missing and 'gisaid_epi_isl' is populated
(is.na(`Accession ID`) & !is.na(gisaid_epi_isl)) ~ gisaid_epi_isl,
# 'Accession ID' is populated and 'gisaid_epi_isl' is populated and 'Accession ID' != 'gisaid_epi_isl'
((!is.na(`Accession ID`) & !is.na(gisaid_epi_isl)) & (`Accession ID` != gisaid_epi_isl)) ~ gisaid_epi_isl,
TRUE ~ gisaid_epi_isl
)) %>%
# mutate(SEQUENCE_EPI_ISL = "") %>%
mutate(SEQUENCE_ROSTER_PREPARE_DATE = format(today(), "%m/%d/%Y")) %>%
mutate(SEQUENCE_SPECIMEN_COLLECTION_DATE = str_trim(SEQUENCE_SPECIMEN_COLLECTION_DATE, side = "both")) %>%
# Exclude index when comparing rows in case there are 2 copies of a file
distinct(across(all_of(key_cols)), .keep_all = TRUE) %>%
# Remove extra columns
select(all_of(key_cols_index))Remove true duplicates and empty rows
Exact duplicates are removed at this point. If an upstream process is run twice and outputs multiple files, this will remove identical records. Empty rows are sometimes introduced after manually opening a csv file. These rows are removed, so they aren’t flagged in the QA checks.
In [11]:
# Creating copy of roster before adding filters and warning columns in case need to rerun anything in the script
Compiled_Roster_before_filters <- Compiled_Roster_Add_Cols
# Identify rows that are completely blank
empty_rows_in_roster <- Compiled_Roster_Add_Cols %>% filter_at(vars(key_cols), all_vars(is.na(.)))
# Remove empty rows
Compiled_Roster <- Compiled_Roster_Add_Cols %>% anti_join(empty_rows_in_roster)Change missing SEQUENCE_REASON to the appropriate reason by lab
In [12]:
Compiled_Roster <- Compiled_Roster %>%
mutate(SEQUENCE_REASON = case_when(
is.na(SEQUENCE_REASON) & str_detect(toupper(SEQUENCE_LAB),"CDC") ~ "SENTINEL SURVEILLANCE",
is.na(SEQUENCE_REASON) & str_detect(toupper(SEQUENCE_LAB),"LAURING LAB") ~ "OTHER",
is.na(SEQUENCE_REASON) & str_detect(toupper(SEQUENCE_LAB),"GRUBAUGH") ~ "OTHER",
is.na(SEQUENCE_REASON) & str_detect(toupper(SEQUENCE_LAB),"PHL") ~ NA_character_,
is.na(SEQUENCE_REASON) & str_detect(toupper(SEQUENCE_LAB),"KAISER") ~ "OTHER",
is.na(SEQUENCE_REASON) & !str_detect(toupper(SEQUENCE_LAB),"CDC|LAURING LAB|GRUBAUGH|PHL|KAISER") ~ "UNKNOWN",
TRUE ~ SEQUENCE_REASON
))First Quality Check
The roster_filters function is used across multiple scripts. In the roster compile script, roster is set to TRUE. This sets the QA_SCA_NA and the QA_COLLECT_DATE check to NA. If a file name includes “reviewed_roster_compiled”, it sets QA_SCA_WDRS_DUPE and QA_SCA_INT_DUPE to NA (This is a temporary fix to allow reviewed records to be rostered and should be removed after updates to checking SCA, SA, and CASE_ID in WDRS). It includes a few additional checks that are included in QA_OTHER. The QA_OTHER checks include:
- Is SEQUENCE_SGTF blank?
- Is SEQUENCE_SPECIMEN hardcoded to ‘YES’?
- Is SEQUENCE_DATE blank?
- Is SEQUENCE_LAB populated and a valid value within lab_variables.rds?
- Is SEQUENCE_REPOSITORY set to ‘GISAID’?
- Is SEQUENCE_ACCESSION in a valid format? (includes “USA/”, doesn’t include “hCoV”, ends in a valid year)
- Is SEQUENCE_REVIEWED blank?
- Is SEQUENCE_SPECIMEN_COLLECTION_DATE formatted correctly? (MM/DD/YYYY or M/D/YYYY)
- Is Case.Note “External data question package updated by COVID19…”
- Does CASE_ID only include numbers?
- Is SEQUENCE_VARIANT_OPEN_TEXT filled in when SEQUENCE_STATUS is ‘COMPLETE’?
- Is SEQUENCE_VARIANT_OPEN_TEXT blank when SEQUENCE_STATUS is ‘LOW QUALITY’ or ‘FAILED’?
The script will print a warning if a lab isn’t in the current list of lab names. The GitHub wiki has the current list of WDRS lab names. Update the lab_variables.RDS in write_lab_variables.R script as needed.
In [13]:
# Using the quality filters to send files to some of the for review folders
Compiled_Roster_qa <- roster_filters(Compiled_Roster, lab_vars, wdrs_sa_seq_values, wdrs_sca_seq_values, lineages$lineage_extracted, roster = TRUE)
if(any(!Compiled_Roster_qa$SEQUENCE_LAB %in% lab_vars$lab_names_wdrs)) {
warning(paste0("Please review list of sequence labs"))
}Save flagged records to For_Review folder
If a record has any flag, it is removed from the Compiled_Roster data. Flagged records that have a SEQUENCE_STATUS as “COMPLETE” or are from “PHL” are sent to the For_Review folder. Flagged records that are not from PHL and have a SEQUENCE_STATUS of “FAILED” or “LOW QUALITY” are dropped.
In [14]:
# Create a dataframe that is referenced in email message
review_dataframe <- data.frame(File = as.character(),
nrows_in_file = as.integer())
# Records with QA flags
roster_for_review <- filter(Compiled_Roster_qa, sum > 0)
# Drop FAILED and LOW QUALITY records from non-PHL submitters
# Remove extra columns
roster_for_review_output <- roster_for_review %>%
filter(SEQUENCE_LAB == "PHL" | !SEQUENCE_STATUS %in% c("FAILED", "LOW QUALITY")) %>%
select(-index, -sum)
# If there are records with QA flags, they need to be saved to the For_Review folder or dropped.
if(nrow(roster_for_review_output) > 0) {
# Save the number of records that need to be reviewed for use in email
review_dataframe <- review_dataframe %>% add_row(File = "Compiled_Roster_For_Review",
nrows_in_file = nrow(roster_for_review_output))
# Save csv file
roster_for_review_output %>%
write_csv(paste0("For_Review/to_process/",
"Compiled_Roster_For_Review_",
format(now(), "%Y-%m-%d-%H%M%S"), ".csv"), na = "")
}
# Remove all records from the roster that were sent to the for_review folder or dropped
Compiled_Roster <- Compiled_Roster %>%
anti_join(roster_for_review, by=key_cols) %>%
distinct()Final Roster
Final Quality Checks
Rerun the roster_filters. At this point, the Compiled_Roster dataframe should not have QA flags. If rows are removed at this point, check if they need to be sent to the For_Review folder. This code chunk can simplified/modified because the roster_filters function has already been used. It provides a nice quality_table that can be helpful as a visual check.
In [15]:
# Last check on records to be rostered
Compiled_Roster <- roster_filters(Compiled_Roster, lab_vars, wdrs_sa_seq_values, wdrs_sca_seq_values, lineages$lineage_extracted, roster = TRUE)
# Remove any records that have a flag and remove extra columns
Compiled_Roster_Clean <- Compiled_Roster %>%
filter(sum==0) %>%
select(all_of(key_cols))
if(nrow(Compiled_Roster_Clean) != nrow(Compiled_Roster)){
stop(paste0("Please review the number of rows in the Compiled_Roster_Clean."))
}Check number of columns
This will stop the script if the number of columns is incorrect. If the WDRS template changes, this number needs to be updated to reflect the change.
In [16]:
# Check number of columns
if (ncol(Compiled_Roster_Clean) != 17){
stop(paste0("Please review number of columns. There should be 17 columns for roster."))
}Prep file for WDRS
Split files
The maximum number of rows for WDRS is 500. This splits the records into lists of 500 records
In [17]:
# Number of rows in roster
number_row_clean_roster <- nrow(Compiled_Roster_Clean)
# Determine the number of files that will be needed
factor_for_roster_split <- rep(seq_len(ceiling(number_row_clean_roster / 500)),each = 500,length.out = number_row_clean_roster)
# Divide the data so there are a max of 500 records in a group
Compiled_Roster_Split <- split(Compiled_Roster_Clean, f = factor_for_roster_split)
# Print dimensions of the groups
lapply(Compiled_Roster_Split, dim)Save files for script runner’s review
This writes the csv files to today’s Archive folder. This is the last point to review the file(s) before sending it to Data Support. Once the records have been saved to the Data Support folder, the file(s) may be uploaded to WDRS quickly, and we shouldn’t make changes to them.
In [18]:
# Number of files to be saved
m <- length(Compiled_Roster_Split)
# Write compiled roster to new date-stamped folder in Archive
for (i in 1:m) {
write_csv(
Compiled_Roster_Split[[i]],
paste0("write_roster_here/Archive/",
today(),
"//Compiled_Roster_",
i,
"_",
str_replace(str_replace_all(format(now()), ":", "-"), " ", "_"),
".csv"
),
na = ""
)
}
# Stop script and review files in the Archive folder. I've been checking that the numbers of rows and columns are correct and look for really high levels of missingness in a column or anything that seems weird. Once the files have been saved to the Data Support folder, we shouldn't make changes to the files, so this is the last chance for reviewing.
stop(paste("Review files before saving to DS_path"))Save files for Data Support
Don’t run this until you have reviewed the files in the Archive folder! This saves the file(s) for Data Support.
In [19]:
# Read in the file path for Data Support folder
DS_path <- "WDRS/Rosters/RosterImports_to_Prod/Sequencing/"
# Save the file(s)
for (i in 1:m) {
write_csv(
Compiled_Roster_Split[[i]],
paste0(DS_path,
"Compiled_Roster_",
i,
"_",
str_replace(str_replace_all(format(now()), ":", "-"), " ", "_"),
".csv"
),
na = ""
)
}
# Confirm that the files have been saved for WDRS upload
stop(paste("Manually check that files have been saved to DS_path"))Send email
An email is sent out with the number of roster files produced and the number of records sent to the For_Review folder.
In [20]:
# Email listed as sender
email_from <- ""
# Email recipients
email_to <-
c(""
)
# Subject line
email_subj <-
"Sequencing - Genome Sequencing Roster Complete Automated Email"
# Number of roster files produced
if (m == 1) {
email_body <-
paste(
"The COVID-19 genome sequencing roster for",
format(today(), "%m/%d/%Y"),
"has been generated. "
)
} else {
email_body <- paste(
m,
" COVID-19 genome sequencing rosters for",
format(today(), "%m/%d/%Y"),
"have been generated."
)
}
# Number of records that need to be reviewed
if(nrow(review_dataframe) > 0) {
email_body_2 <- paste(
"
", (review_dataframe$nrows_in_file), "record(s) could not be added to the COVID-19 genome sequencing roster. ", " Please review the file located in the For Review/to_process folder."
)
} else {
email_body_2 <- ""
}
# Send email
sendmailR::sendmail(from = email_from,
to = email_to,
subject = email_subj,
msg = c(email_body, email_body_2),
headers= list("Reply-To" = email_from),
control = list(smtpServer = ""))